Skip to content

Add fast fail override for perfomance benchmarks#1769

Open
tadiwa-aizen wants to merge 5 commits intoawslabs:mainfrom
tadiwa-aizen:fast-fail-benchmarks
Open

Add fast fail override for perfomance benchmarks#1769
tadiwa-aizen wants to merge 5 commits intoawslabs:mainfrom
tadiwa-aizen:fast-fail-benchmarks

Conversation

@tadiwa-aizen
Copy link
Contributor

@tadiwa-aizen tadiwa-aizen commented Feb 18, 2026

What changed and why?
Added configurable fail-fast mode for benchmark sweeps to enable early termination on failures.

Previously, benchmark sweeps would run all parameter combinations regardless of failures, making it difficult to catch configuration errors early in long-running sweeps. This change
adds a fail_fast parameter to the smart benchmark sweeper that, when enabled, runs benchmarks sequentially and stops immediately upon the first failure. This allows quick
identification of issues during development without wasting resources on remaining benchmarks.

Implementation details:

  • When fail_fast=true: Benchmarks run one at a time (batch_size=1), checking status after each job
  • When fail_fast=false (default): All benchmarks run in one batch, maintaining existing behavior
  • On failure with fail_fast=true: Re-raises the original exception with full context (stack trace, error message, failed configuration)

Does this change impact existing behavior?

No breaking changes. Default behavior (fail_fast=false) maintains existing functionality where all benchmarks run regardless of failures.

Testing

Unit tests in tests/test_execute_batches.py:

  • test_fail_fast_true_stops_on_first_failure: Verifies that with fail_fast=true, the sweeper stops after the first failed job and doesn't execute remaining benchmarks (validates call count = 2 for 3 jobs when 2nd fails).

  • test_fail_fast_false_continues_through_failures: Confirms that with fail_fast=false, all benchmarks run in a single batch and failures are captured without stopping execution.

Run tests with:

cd mountpoint-s3/benchmark
uv run pytest tests/test_fail_fast.py -v

Does this change need a changelog entry? Does it require a version change?

No - this only affects internal benchmark tooling, not the Mountpoint binary or published crates.


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and I agree to the terms of the Developer Certificate of Origin (DCO).

Signed-off-by: Tadiwa Magwenzi <tadiwaom@amazon.com>
Signed-off-by: Tadiwa Magwenzi <tadiwaom@amazon.com>
@tadiwa-aizen tadiwa-aizen marked this pull request as ready for review February 19, 2026 19:23
Signed-off-by: Tadiwa Magwenzi <tadiwaom@amazon.com>
# Accessing return_value raises an exception if the job failed (hydra/core/utils.py:251-258)
if self.fail_fast:
for r in results:
_ = r.return_value # Raises on failure, stopping the sweep
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if a job fails ? Does the exception message or logs get captured
clearly so users know which benchmark failed and why?

_target_: str = "hydra_plugins.smart_sweeper.smart_benchmark_sweeper.SmartBenchmarkSweeper"
max_batch_size: Optional[int] = None
params: Optional[Dict[str, str]] = None
fail_fast: bool = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this worth documenting somewhere? Users should know when and why they'd want to enable fail_fast=true. Consider this if it feels reasonable.

returns.append(results)

# Determine batch size: run all at once (fail_fast=False) or one at a time (fail_fast=True)
batch_size = 1 if self.fail_fast else len(all_combinations)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should fail_fast impact the batch size? Wouldn't it be faster to run in parallel but cancel on failure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our SmartBenchmarkSweeper uses Hydra's BasicLauncher by default, which executes jobs sequentially
not in parallel. With fail_fast=True and batch_size=1, the sweeper passes one job at a
time to the launcher, gets the result back immediately, and can check if it failed before launching the next job. If we used batch_size=len(all_combinations), the launcher would run all jobs sequentially and only return after completing every single one, meaning we'd waste time running jobs after a failure instead of stopping early. This fits our requirements.

(For your parralel jobs question..: Hydra does support parallel launchers like
JoblibLauncher, but they don't support canceling in-flight jobs on first failure—they run all submitted jobs to completion. I'm also not sure the effect that parallel might have on the benchmark results in general)

# Accessing return_value raises an exception if the job failed (hydra/core/utils.py:251-258)
if self.fail_fast:
for r in results:
_ = r.return_value # Raises on failure, stopping the sweep
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does accessing return_value raise on failure?

for r in results:
_ = r.return_value # Raises on failure, stopping the sweep

initial_job_idx += len(batch)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is initial_job_idx the same as i?

# Determine batch size: run all at once (fail_fast=False) or one at a time (fail_fast=True)
batch_size = 1 if self.fail_fast else len(all_combinations)

for i in range(0, len(all_combinations), batch_size):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a test for this.

"""

# Create mock launcher and setup
mock_launcher = Mock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract to function build_smart_sweeper?

sweeper_normal._extract_benchmark_types = Mock(return_value=['fio'])

# Test the batching logic in the sweeper
all_combinations = combinations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why renaming the variable?


# Test the batching logic in the sweeper
all_combinations = combinations
batch_size = 1 if sweeper_normal.fail_fast else len(all_combinations)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't testing anything useful.


# Test with fail_fast=True (should batch one at a time)
sweeper_fast_fail = SmartBenchmarkSweeper(fail_fast=True)
batch_size_fast_fail = 1 if sweeper_fast_fail.fail_fast else len(all_combinations)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, this isn't testing anything useful

results = [mock_result_success]

# Accessing return_value on a successful job should NOT raise an exception
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not useful.

Signed-off-by: Tadiwa Magwenzi <tadiwaom@amazon.com>
@tadiwa-aizen tadiwa-aizen requested a deployment to PR integration tests March 2, 2026 01:23 — with GitHub Actions Waiting
results = self.launcher.launch(all_combinations, initial_job_idx=initial_job_idx)
returns = self._execute_batches(all_combinations, initial_job_idx)

return returns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this meant to return nothing if all_combinations is empty?

["benchmark_type=fio", "mountpoint.stub_mode=off", "network.maximum_throughput_gbps=100"],
]

def test_fail_fast_true_stops_on_first_failure(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests look better, though have you verified that it actually behaves this way when called from the CLI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, verified with a real CLI run. Configured a benchmark sweep with 36 total combinations and intentionally used an invalid network interface name (ens129 instead of ens33) to trigger a failure.

Results with fail_fast=true:

  • Job # 0: SUCCESS (fio, single NIC)
  • Job # 1: SUCCESS (fio, single NIC, direct_io=True)
  • Job # 2: FAILED (fio, dual NIC with invalid ens129 interface)
  • Stopped immediately - remaining 33 jobs never executed

The framework correctly caught the CalledProcessError from the mount-s3 command failing with AWS_ERROR_INVALID_ARGUMENT, and stopped execution as expected.

assert mock_launcher.launch.call_count == 1
assert len(results) == 1
assert len(results[0]) == 3
assert results[0][1].status == JobStatus.FAILED # Verify failure is captured
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing verification that we have success in the other 2 cases

mock_launcher = Mock()
sweeper.launcher = mock_launcher

# side_effect makes mock return different values on each call: 1st call gets 1st item, 2nd call gets 2nd item, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need excess comments describing what the code does directly.

Signed-off-by: Tadiwa Magwenzi <tadiwaom@amazon.com>
@tadiwa-aizen tadiwa-aizen requested a deployment to PR integration tests March 3, 2026 18:24 — with GitHub Actions Waiting
@tadiwa-aizen tadiwa-aizen requested a review from muddyfish March 3, 2026 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants